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ABSTRACT 

A 

Three  stages  of  visual  processing  determine  how  internal  noise  appears  to  an  external  observer:  light 
adaptation,  contrast  gain  control,  and  a  postsensory /decision  stage.  Dark  noise  occurs  prior  to  adaptation, 
determines  dark-adapted  absolute  thresholds,  and  mimics  stationary  external  noise.  Sensory  noise  occurs 
after  dark  adaptation,  determines  contrast  thresholds  for  sine  gratings  and  similar  stimuli,  and  mimics 
external  noise  that  increases  with  mean  luminance.  Postsensory  noise  incorporates  perceptual,  decision, 
and  mnemonic  processes.  It  occurs  after  contrast-gain  control  and  mimics  external  noise  that  increases 
with  stimulus  contrast  (i.e.,  multiplicative  noise).  Dark  noise  and  sensory  noise  are  frequency  specific  and 
primarily  affect  weak  signals.  Only  postsensory  noise  significantly  affects  to  strong  signals,  and  it  has 
constant  power  over  a  wide  spatial  frequency  range  in  which  sensory  noise  varies  enormously,  j  -  . ,  . ,  — 

Two  parallel  perceptual  regimes  jointly  serve  human  object  recognition  and  motion  perception:  a 
first-order  linear  (Fourier)  regime  that  computes  relations  directly  from  luminances,  and  a  second-order 
nonlinear  (nonFourier)  rectifying  regime  that  uses  the  absolute  value  of  detector  outputs.  When  objects  or 
movements  are  defined  by  high  spatial  frequencies  (i.e.,  texture  carrier  frequencies  whose  wavelengths  are 
small  compared  to  the  object  size),  the  responses  of  high-frequency  receptors  are  demodulated  by 
rectification  to  facilitate  discrimination  at  the  next  hierarchical  processing  levels.  Rectification  sacrifices 
the  statistical  efficiency  (noise  resistance)  of  the  first-order  regime  for  efficiency  of  connectivity  and 
computation. 
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Three  Stages  and  Two  Systems  of  Visual  Processing 
George  Sperling 

Human  Information  Processing  Laboratory,  Department  of  Psychology, 

New  York  University,  New  York,  NY  10003 

Bandpass  filtering  refers  to  the  processing  of  images  or  sounds  so  that  they  contain  only  a  narrow 
range-typically  one  or  two  octaves-of  component  frequencies.  In  audition,  bandpass  filtering  is  used  to 
create  stimuli  that  stimulate  only  a  small  portion  of  the  basilar  membrane.  By  studying  psychophysical 
responses  to  stimuli  filtered  in  different  bands,  information  processing  mediated  by  each  portion  of  the 
basilar  membrane  can  be  studied. 

In  vision,  the  aim  of  bandpass  filtering  is  to  create  stimuli  that  stimulate  only  one  or  a  small  number 
of  the  visual  channels  that  operate  in  parallel  to  process  visual  stimuli.  Ideally,  stimuli  filtered  in  high 
frequency  bands  would  stimulate  only  receptors  (channels)  with  small  receptive  fields.  Stimuli  filtered  in 
low  frequency  bands  would  stimulate  only  channels  that  have  large  receptive  fields.  (The  tom  channel  is 
used  here  to  designate  an  information  processing  system  characterized  by  receptors  of  a  particular  size.)  As 
in  audition,  there  is  substantial  interest  not  only  in  how  stimuli  that  are  confined  to  a  single  band  are 
processed,  but  also  in  how  information  from  stimuli  in  different  bands  is  perceptually  combined. 

With  the  advent  of  affordable  graphics  processors,  bandpass  filtering  has  become  an  increasingly 
widespread  stimulus  manipulation  in  vision.  Working  with  bandpasssed  stimuli  raises  to  the  fore  some 
important  issues  that  are  the  subject  of  this  article.  With  hindsight,  we  see,  as  usual,  that  some  of  these 
issues  have  have  been  confronted  before,  but  advent  of  bandpassed  stimuli  offers  important  new  insights. 
In  other  cases,  new  stimuli  and  procedures  raise  new  questions  and  offer  new  opportunities.  This  paper 
coordinates  data  that  have  emerged  from  paradigms  that  utilize  bandpass  filtered  stimuli  together  with  a 
variety  of  other  data  in  order  to  arrive  at  some  general  principles  of  sensory  information  processing. 

1.  Visual  Noise  at  Three  Stages  of  Processing 

Coasider  first  *  study  that  was  originally  designed  to  determine  whether  image  spatial  frequencies  or 
object  spatial  frequencies  were  critical  for  object  discrimination.  Parish  and  Sperling  (1987a,  1987b) 
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filtered  individual  capital  letters  in  five  different  spatial  frequency  bands  (Fig.  1).  They  studied  the  role  of 
three  factors  in  the  ability  of  subjects  to  identify  these  letters  when  they  were  embedded  in  noise:  (1)  the 
signal-to-noise  ratio,  (2)  the  object-relative  spatial  frequency  band  in  which  the  letters-plus-noise  were 
filtered,  and  (3)  the  viewing  distance  (which  determined  retinal  spatial  frequency).  They  found  that 
identification  accuracy  was  independent  of  viewing  distance  over  a  range  of  more  than  30:1.  In  this  w^e 
range,  retinal  spatial  frequency  did  not  matter  in  determining  recognition  accuracy;  only  object  spatial 
frequency  mattered.  On  the  other  hand,  visual  sensitivity  to  sine  gratings  at  threshold  varies  enormously 
within  the  same  range  of  retinal  frequencies.  In  this  section,  we  examine  sine-grating  detection  and  letter 
discrimination  in  order  to  define  the  various  sources  of  noise  that  limit  visual  performance. 
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Additive  and  Multiplicative  Noise 
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We  consider  two  kinds  of  noise:  additive  noise  and  multiplicative  noise.  The  term  additive  noise  is 
used  here  to  denote  a  stationary  noise  source  that  is  independent  of  the  signal  and  is  added  to  the  signal. 
Additive  noise  can  be  overcome  by  increasing  signal  strength  until  the  effective  signal-to-noise  ratio  is 
sufficient  support  the  desired  level  of  performance. 


Multiplicative  noise  is  proportional  to  the  signal,  that  is,  it  multiplies  the  signal.  For  example,  in  a 
binary  (dark-grey/light-grey)  image,  reversing  the  contrast  of  (multiplying  by  - 1 )  a  randomly  chosen  10 
percent  of  the  pixels  would  be  a  form  of  multiplicative  noise.  Increasing  the  intensity  or  the  contrast  of  the 
image  would  not  alter  its  signal-to-noise  ratio.  Multiplicative  noise  is  equivalent  to  adding  a  noise  whose 
expected  power  is  proportional  to  signal  power.  Several  authors  have  noted  the  distinction  between 
additive  and  multiplicative  noise  (e.g.,  Carlson  &  Klopfenstien,  1985;  Legge  and  Foley,  1980;  Legge, 
Kersten,  &  Burgess,  1987;  Pavel,  Sperling,  Riedl,  &  Vanderbeek,  1987).  Loss  of  information  that  results 
from  too- sparse  sampling  of  the  stimulus  also  can  be  regarded  as  a  multiplicative  noise  because  in  cannot 
be  overcome  by  increasing  signal  strength  (e.g.,  Legge,  et  al,  1987). 
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Multiplicative  noise  cannot  be  responsible  for  detection  or  discrimination  thresholds  that  are  reached 
by  reducing  the  strength  of  a  signal.  Because  multiplicative  noise  declines  proportionately  with 
diminishing  signal  strength,  weak  signals  are  not  worse  off  than  strong  signals.  Because  sufficiently  low- 
luminance  or  sufficiently  low-contrast  visual  signals  are  not  visible,  we  infer  that  the  internal  noise  that 
limits  vision  at  low  contrasts  is  better  represented  as  additive  rather  than  as  multiplicative  noise. 

There  are  many  visual  discrimination  tasks  in  which  increasing  stimulus  luminance  or  contrast  does 
not  improve  performance.  Consider  five  examples.  In  attempting  to  detect  a  spatial  sine  wave  grating 
embedded  in  noise,  the  contras:  of  the  display  has  no  effect  on  performance  once  a  critical  contrast  is 
reached  (Pelli,  1981).  In  detecting  spatial  amplitude  modulation  of  a  one-dimensional  spatial  noise,  once 
about  eight  times  the  contrast  threshold  for  the  noise  is  reached,  further  increases  in  overall  contrast  do  not 
make  the  modulation  more  delectable  (Jamar,  Campagne,  &  Koenderink,  1982;  Jamar  &  Koenderink, 
1985).  In  Parish  &  Sperling’s  (1987b)  letter-in-noise  discrimination  task,  only  the  signal-to-noise  ratio 
matters  (Note  1).  In  discriminating  direction  of  motion,  once  a  contrast  of  about  0.05  is  reached,  further 
increases  in  contrast  do  not  improve  performance  (Nakayama  &  Silverman,  1985).  In  audition,  similar 
kinds  of  results  in  which  only  the  signal-to-noise  ratio  (and  not  absolute  loudness  matters)  are  the  norm. 
For  example,  when  a  noisy  radio  broadcast  is  loud  enough  to  be  distinctly  heard,  making  it  louder  does  not 
make  it  more  intelligible.  The  visual  analog,  the  independence  of  the  discriminability  of  noisy,  dynamic 
visual  signals  upon  the  stimulus  contrast  at  which  they  were  viewed  was  verified  over  an  4:1  range  of 
contrasts  by  Pavel,  Sperling,  Riedl,  and  Vanderbeek  (1987).  In  such  cases,  human  performance  appears  to 
be  characterized  by  multiplicative  noise. 

From  a  theoretical  point  of  view,  it  is  important  to  note  that  systems,  which  appear  upon  external 
examination  to  have  identical  multiplicative  noise,  may  have  vastly  different  internal  mechanisms  for 
generating  their  behavior.  Viewed  externally,  the  internal  operation  of  multiplying  the  noise  by  a  factor  k 
before  adding  it  to  the  signal  is  equivalent  to  the  internal  operation  of  dividing  the  signal  by  k  before 
adding  it  to  the  noise.  Both  result  in  the  same  internal  signal-to-noise  ratio  sin,  The  equivalence  of 
dividing  signals  by  k  and  multiplying  noise  by  k  suggests  gain  control  as  a  physiologically  plausible 
internal  mechanism  to  mimic  multiplicative  noise:  The  gain-control  multiplies  input  signals  by  1  Ik  before  a 
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constant-power  internal  noise  is  added. 

Three  Sources  of  Visual  Noise. 

To  understand  how  internal  noise  sources  appear  when  viewed  from  the  outside,  it  is  useful  to 
consider  three  stages  of  visual  processing:  light  adaptation,  contrast  gain  control,  and  decision.  Figure  2 
illustrates  a  flow  chart  for  the  computations  carried  out  by  these  early  stages.  The  particular  mechanisms 
indicated  in  Fig.  2  for  light  adaptation  and  for  contrast-gain  control  are  based  on  physiologically  plausible 
principles.  They  are  vastly  oversimplified  and  serve  to  illustrate  the  functional  principles  of  the  processes 
of  light  adaptation  and  gain  control  rather  than  the  precise  details  (cf.  Shapley  &  Enroth  Cugell,  1986).  For 
example,  the  flow  charts  omits  the  division  of  signals  into  two  distinct  pathways  that  carry  only  positive 
and  only  negative  signals  (the  on-center  and  the  off-center  neurons),  parallel  spatial  frequency  channels  are 
not  explicitly  treated,  there  is  no  gain  control  for  W2,  and  so  on. 


Insert  2  here. 


Three  stationary  noise  sources  are  illustrated  in  Fig.  2;  each  has  constant  expected  power  and  an 
unchanging  frequency  spectrum.  The  three  stages  at  which  noise  is  added  are  (1)  directly  at  the  input,  (2) 
after  light  adaptation,  and  (3)  after  contrast-gain  control. 

Dark  Noise. 

In  absolute  darkness,  the  spontaneous  activity  of  the  visual  receptors,  rods  and  cones,  is  represented 
as  dark  noise  (Barlow,  1956,  1957).  Dark  noise  is  prior  to  any  processes  responsible  for  light  adaptation. 
To  be  reliably  detected  against  a  totally  dark  background,  a  signal  must  exceed  not  only  the  level  of  dark 
noise  but  also  die  combined  level  of  all  noise  in  the  visual  pathways.  However,  it  would  be  expected  that, 
through  evolution,  absolute  threshold  would  be  determined  primarily  by  dark  noise.  That  is,  for  receptors 
to  sene  most  efficiently,  their  amplification  gain  would  have  increased  (through  evolution)  up  to  the  point 
where  the  receptor  noise  itself  was  the  limiting  factor. 
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Sensory  noise. 

Sensory  noise  is  the  limiting  noise  in  the  detection  of  weak  signals  against  uniform  backgrounds. 
For  example,  by  definition,  a  spatial  sine  wave  grating  with  a  contrast  of  0.0001  has  an  absolute  modulation 
that  is  proportional  to  its  mean  luminance.  The  brighter  the  illumination,  the  greater  its  absolute 
modulation.  If  there  were  no  sensory  noise,  than  increasing  the  absolute  modulation  of  a  spatial  sinewave 
grating  by  increasing  its  mean  luminance  at  constant  contrast  ultimately  would  increase  its  absolute 
modulation  to  the  point  of  visibility  (even  with  quantal  noise  in  the  stimulus).  However,  at  high 
luminances,  grating  stimuli  are  visible  very  nearly  in  proportion  to  their  contrast,  not  to  their  absolute 
modulation.  The  essential  fact  of  sensory  noise  is  that,  when  viewed  from  outside  the  system  at  moderate 
to  high  intensities,  its  apparent  power  increases  with  absolute  modulation  rather  than  remaining  constant. 
To  model  noise  that  apparently  increases  with  the  mean  luminance  (background  luminance),  the  sensory 
noise  source  is  placed  after  (central  to)  the  gain  control  that  modulates  visual  responsiveness  as  a  function 
of  intensity.  Constant  sensory  noise,  placed  after  the  gain-control  mechanism,  mimics  an  external  additive 
noise  that  increases  as  a  function  of  background  intensity. 

Sensory  noise,  Weber's  law,  quantal  fluctuations.  Weber’s  law  asserts  that  the  minimum  detectable 
increment  in  intensity  AS  increases  in  direct  proportion  to  background  intensity  5  on  which  it  is 
superimposed;  at  threshold:  AS  IS  =k  ,  a  constant.  Assume  that,  at  threshold,  a  constant  signal-to-noise 
ratio  is  required  at  the  detector  itself:  s/n=  signal  amplitude /root-mean-square  (RMS)  noise  amplitude. 
Indeed,  the  effective  signal-to-noise  ratio  of  the  stimulus  at  the  detector  is  equivalent  to  the  d'  statistic  of 
signal  detection  theory  (Green  &  Swets,  1966).  Internal  noise  after  adaptation  to  the  background  is 
equivalent  to  external  noise  whose  RMS  power  increased  in  proportion  to  the  background  intensity:  either 
results  in  Weber’s  law  behavior  because,  to  maintain  a  constant  signal-to-noise  ratio,  the  threshold 
increment  would  have  to  increase  in  direct  proportion  to  the  mean  background.  Thus  sensory  noise  is 
hypothesized  to  be  the  source  of  Weber’s  law. 

Most  visual  stimuli  are  produced  by  sources  that  can,  for  practical  purposes,  be  approximated  as 
quantal  emitters.  This  means  that,  even  with  a  nominally  constant  stimulus,  the  number  of  quanta  collected 
by  the  retina  in  any  given  area  varies  from  occasion  to  occasion  and  is  characterized  by  a  Poisson 
distribution.  The  variance  of  the  Poisson  distribution  is  equal  to  its  mean,  therefore,  the  RMS  power  of 
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quantum  noise  increases  in  direct  proportion  to  the  square  root  of  the  luminance  of  visual  stimuli.  Because 
quantal  noise  increases  with  the  stimulus  amplitude,  it  usually  is  considered  in  conjunction  with  sensory 
noise. 

The  full  analysis  of  quantal  noise  in  the  stimulus  itself  together  with  such  factors  as  the  blur  of  the 
visual  optics  and  the  spacing  of  retinal  cone  receptors  is  quite  complex.  For  example,  Banks,  Geisler  & 
Bennett  (1988)  and  Geisler  (1989)  applied  such  an  analysis  to  contrast  detection  thresholds  for  sine  wave 
stimuli  of  spatial  frequencies  from  5  to  40  cycles  per  degree,  at  mean  luminances  from  3.4  to  340  cd Jm2 
(10.7  to  1068  trolands).  Stimuli  at  each  frequency  consisted  of  seven  sine  cycles;  i.e.,  the  stimuli  of 
different  spatial  frequencies  were  scaled  replicas  of  each  other.  Once  all  the  all  the  factors  prcneural  cited 
above  had  been  taken  into  account.  Banks  et  al  found  that,  at  the  observed  thresholds,  the  stimulus  sin  at 
the  detector  was  constant  The  most  parsimonious  interpretation  is  that  sensory  noise  is  negligible 
compared  to  quantal  stimulus  noise  for  their  stimuli.  For  sine  wave  gratings  at  lower  spatial  frequencies 
than  5  cpd  and  for  more  intense  stimuli  at  all  spatial  frequencies,  sensory  noise  becomes  quite  significant 
relative  to  quantal  noise.  At  very  low  levels  of  background  luminance,  dark  noise  becomes  important 
(Geisler,  1989).  Indeed,  a  model  such  as  that  of  Fig.  2,  together  with  threshold  data  obtained  at  different 
adaptation  levels,  offers  a  clear  distinction  between  and  independent  estimates  of  residual  sensory  noise 
and  dark  noise. 

Postsensory  Noise. 

In  the  superthreshold  experiments  with  added  external  noise  discussed  above,  detection  depended 
only  on  the  signal-to-noise  ratio  sin  and  not  on  on  the  contrast  at  which  these  signals-plus-noise  were 
viewed.  In  terms  of  a  model,  the  dependence  of  objective  performance  measures  (such  as  direction-of- 
motion  judgments,  intelligibility  scores,  letter  discriminations)  on  s/n  and  their  independence  of  stimulus 
contrast  is  represented  by  a  contrast-gain  control  that  equates  all  signals  that  exceed  a  minimal  contrast 
level.  For  example,  the  input/output  function  illustrated  in  the  contrast-gain  control  box  of  Fig.  2  is  shaped 
like  a  logistic  function  with  an  asymptotic  output  of  -1  for  large  negative  contrasts,  and  an  asymptotic 
output  of +1  for  large  positive  contrasts.  A  constant  noise  source  that  was  located  centrally  to  (added  after) 
such  a  gain  control  would  appear  to  an  external  observer  to  be  equivalent  to  an  external  noise  source  that 
was  directly  proportional  to  contrast  in  those  ranges  of  input  where  the  gain-control  was  near  its 
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asymptotes. 

From  a  functional  point  of  view,  all  noise  sources  that  are  added  after  contrast-gain  control  will 
appear  externally  to  be  multiplicative  noises,  proportional  to  stimulus  contrast.  There  are  many  such 
sources.  Consider  a  two-altemative  forced-choice  intensity  discrimination  task.  In  successive  intervals,  an 
observer  is  presented  with,  for  example,  sounds  of  intensity  40  and  41  db  and  required  to  say  which 

interval  contained  the  louder  sound.  Generally,  observers  do  better  in  a  pure  block  of  trials  (only  two 

sounds  40  and  41  db  occur)  than  in  a  mixed  block  (e.g.,  trials  with  40  and  41  db  mixed  in  with  trials 

containing  60  and  61  db,  a  "roving"  discrimination-Berliner  &  Durlach,  1973).  In  the  pure  block,  the 

inability  of  the  human  observer  to  equal  the  performance  of  the  ideal  observer  is  attributed  to  a 
combination  of  (human)  sensory  and  decision  noise.  In  the  mixed  block,  there  is  additional  "context"  noise 
due  to  an  attentional/mnemonic  component  In  an  identification  task,  where  observers  must  name  each 
stimulus  (e.g.,  40,  41,  60,  61  db)  their  performance  can  be  characterized  as  being  further  degraded  by 
mnemonic  noise. 

The  relative  levels  of  performance  in  any  two  complex  detection,  discrimination,  or  identification 
tasks  will  be  determined  by  a  combination  of  shared  noise  sources  and  task-specific  noise  sources  (e.g., 
MacMillan,  1987).  All  these  postsensory  noise  sources  are  grouped  together  under  the  heading  of 
postsensory  noise,  representing  perceptual,  contextual,  decision,  attentional,  mnemonic,  and  response 
processes  that,  according  to  the  task,  add  noise  after  contrast-gain  control. 

To  recapitulate:  In  vision,  at  threshold,  sensitivity  is  governed  by  the  intrinsic  additive  noise  of  the 
visual  system  (Pelli,  1981).  Above  threshold,  matters  apparently  are  quite  different  "The  notion  of  the 
observer’s  equivalent  noise,  which  has  been  so  useful  in  understanding  detection,  is  found  not  to  be 
relevant  at  supratheshold  contrasts."  (Pelli,  1981,  p.  121).  However,  to  formulate  coherent  theories  of 
performance,  we  need  merely  to  enlarge  the  concept  of  equivalent  noise  to  include  noise  sources  that,  to  an 
external  observer,  appear  to  vary  with  adaptation  (because  they  are  located  after  adaptation  gain  control) 
and  noise  sources  that  appear  to  vary  with  stimulus  contrast  (because  they  are  located  after  contrast-gain 
control). 

The  Efficiency  of  Detection. 
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The  efficiency  ef  f  of  detection  or  of  discrimination  is  the  ratio  of  s,  Ini  required  by  an  ideal  observer 
to  the  ratio  required  by  a  human  observer  at  the  same  criterion  level  c  of  performance: 


eff  = 


7 


Sk 


nn 


.  For  example,  in  a  visual  display  of  n  independent,  equivalendy  informative 


pixels,  eff  is  the  fraction  of  the  n  pixels  that  the  ideal  observer  needs  to  observe  in  order  to  match  human 
performance. 


Experimental  determinations  of  efficiency  establish  an  upper  bound  on  the  power  of  the  human 
internal  noise  sources.  Parish  &  Sperling  (1987a)  determined  the  efficiency  of  human  discrimination  in 
identifying  visual  letters  masked  by  noise.  When  both  the  letters  and  noise  were  passed  through  a  filter 
centered  at  1.05  cycles  per  letter  height,  efficiency  exceeded  0.40.  Furthermore  this  high  efficiency  was 
observed  over  a  30:1  range  of  viewing  distances.  At  the  different  viewing  distances,  these  stimuli  are 
transduced  by  visual  channels  characterized  by  vastly  different  retinal  spatial  frequencies.  The  constant 
high  efficiency  suggests  that  information  loss  in  the  visual  pathway  before  the  point  of  postsensory  noise 
was  negligible.  In  terms  of  noise  sources,  this  means  that  dark  noise  and  sensory  noise  were  negligible 
compared  to  stimulus  noise,  and  the  postsensory  noise  was  of  same  order  of  power  as  the  real  stimulus 
noise.  Over  the  enormous  range  of  spatial  frequencies  subserved  by  these  channels,  efficiency  was 
determined  primarily  by  postsensory  noise. 


2.  Letter  Discrimination,  Noise,  and  the  Spatial  Modulation  Transfer  Function  (MTF) 


The  MTF,  also  called  the  contrast  transfer  function ,  is  the  function  that  gives  the  contrast  modulation 
of  a  sinewave  grating  at  its  threshold  of  detection  as  a  function  of  its  spatial  frequency  (Fig.  3).  For  the 
sine  waves  that  fall  in  the  range  of  retinal  spatial  frequencies  investigated  in  Parish  &  Sperling’s  (1987a) 
letter  detection  experiments,  contrast  threshold  ranges  from  a  minimum  of  0.002  at  5  -  8  cpd  to  a  maximum 
of  0.7  at  37  cpd  (e.  g.,  van  Nes  and  Bouman,  1967,  cf  Fig.  3).  The  frequency  of  37  cpd  is  the  mean  retinal 
frequency  of  Parish  &  Sperling’s  highest  frequency  band  b 5  at  their  longest  viewing  distance.  The  most 
detectable  retinal  frequencies  (5  -  8  cpd)  are  produced  by  frequency  bands  fi3,  f>4,  and  65  (Fig.  ’)  at  closer 
viewing  distances.  In  all  these  letter-discrimination  conditions,  observed  discrimination  efficiency  was 
independent  of  the  mean  retinal  frequency  whereas  threshold  sensitivity  for  sinusoidal  gratings  varies  from 
0.7  to  0.002,  a  factor  of  35,  within  this  frequency  range  (Fig.  3).  Indeed,  the  combination  with  of  filter 
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frequency  with  viewing  distance  produces  retinal  frequencies  that  vary  over  a  range  of  more  than  200:1. 
Figure  3  illustrates  the  division  of  spatial  frequencies  into  three  regions: 

(1)  The  top  region  which  represents  invisible  sinusoidal  gratings-their  contrast  is  below  detection 
threshold. 

(2)  A  middle  region,  indicated  in  grey,  in  which  detection  is  governed  by  quantal  and  sensory  noise. 
In  this  region,  increasing  stimulus  contrast  improves  performance. 

(3)  The  lower  region  in  which  postsensory  noise  predominates.  Here,  noise  is  proportional  to 
contrast  so  performance  is  independent  of  contrast.  The  numbers  indicate  the  center  frequency  (projection 
on  the  x-axis)  of  various  bandpass  stimulus  conditions  of  the  Parish-Sperling  study,  and  the  approximate 
contrast  level  (0.1,  projection  on  the  y-axis)  at  which  performance  becomes  independent  of  stimulus 
contrast. 

Previously,  Jamer  &  Koenderink  (1985)  had  noted  an  apparent  independence  of  spatial-frequency  in 
the  detection  of  amplitude  modulated  noise  gratings.  They  investigated  a  relatively  small  range  of 
frequencies  and  did  not  determine  the  efficiency  of  detection.  In  letter  detection,  the  enormous  range  of 
frequency  invariance,  and  the  extremely  low  level  of  decision  noise  (as  demonstrated  by  comparison  with 
ideal  detectors)  is  truly  astounding. 


Insert  3  here. 


Detection  thresholds  for  sine  gratings  vary  enormously  with  retinal  spatial  frequency  in  precisely  the 
same  range  of  frequencies  where  discrimination  threshold  for  letters-in-noise  is  constant  The  difference 
between  the  two  experiments  is  readily  interpreted  in  terms  of  the  levels-of-noise  model.  The  grating 
detection  expo  ir  is  limited  by  quantum  noise  in  the  stimulus  and  by  sensory  noise;  the  leuer-in-noise 
disc  rim  inaiicr  .;^riment  is  limited  by  postsensory  noise.  Whereas  letter-in-noise  discrimination  is 
unaffected  bv  stitr-t.  ■«-  contrast  over  a  wide  range,  stimulus  contrast  is  the  dependent  variable  in  the  grating 
detection  experiment.  Indeed,  the  grating  detection  experiment  can  be  viewed  as  indicating  the  effective 
power  of  quantal  plus  sensory  noise  as  a  function  of  spatial  frequency.  We  say  "effective  power"  because 
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there  is  no  provision  in  the  simple  stage  model  for  input  amplification  that  may  vary  as  a  function  of  spatial 
frequency:  input  gain  is  incorporated  into  sensory  noise. 

Conclusions.  Representing  grating  detection  and  letter-in-noise  discrimination  as  noise  limited 
processes,  yields  the  following  conclusions:  (1)  Sine  grating  detection  at  low  stimulus  contrasts  is  limited 
by  quantal  noise  (Banks,  et  al,  1988)  and  by  sensory  noise  (Pelli,  1981)  each  of  which  varies  little  with 
stimulus  contrast  but  varies  greatly  with  retinal  spatial  frequency  and  with  mean  luminance.  (2)  Letter 
detection  at  stimulus  contrasts  greater  than  about  0.10  is  limited  by  apparently  multiplicative  noise  that  is 
proportional  to  stimulus  contrast  but  is  independent  of  spatial  frequency  (fix  a  100-fold  range  of  retinal 
spatial  frequencies).  (3)  When  letters  are  discriminated  in  external  noise  which  deliberately  is  not 
negligible,  the  effective  internal  noise  apparently  varies  multiplicativeiy  with  stimulus  contrast  These 
empirical  relationships  follow  from  the  stage  model  of  Fig.  2;  and  they  are  illustrated  in  Fig.  3.  (4)  Sensory 
and  postsensory  noise  are  independent  and  vary  differently  with  spatial  frequency.  For  example,  the 
channel  that  transduces  5-8  cpd  has  the  lowest  sensory  noise,  but  it  has  the  same  decision  noise  as  the 
channel  that  processes  37  cpd,  which  has  35  times  more  sensory  noise. 

Analogous  phenomena  in  psychoacoustics.  A  similar  pattern  of  strong  frequency  dependence  of 
threshold  detection  and  frequency  independence  of  high-intensity  discrimination  occurs  in 
psychoacoustics.  For  example,  absolute  intensity -detection  thresholds  A/(/)  for  sinusoidal  pressure 
waveforms  vary  enormously  as  a  function  frequency  / .  At  high  signal  levels,  detection  thresholds  for 
sinusoidal  increments  A 1(f)/ 1  hardly  vary  with  frequency  (Robinson  &  Dadson,  1956;  Reisz,  1928; 
Jesteadt,  Wier,  &  Green,  1977;  see  Scharf  &  Buus,  1986,  for  a  review).  Detection  limits  at  low  input  levels 
are  quite  different  from  discrimination  limits  al  high  input  levels.  The  nature  of  these  differences  is 
dictated  by  requirements  of  having  maximally  sensitive  receptors  and  of  operating  over  an  enormous 
dynamic  range.  Since  these  problems  are  shared  by  many  modalities,  we  should  not  be  surprised  at 
functionally  similar  solutions. 

Advantage  of  above-threshold  gain  that  is  independent  of  frequency.  A  visual  object  is  characterized 
by  relations  between  its  component  spatial  frequencies.  When  the  object  is  viewed  from  nearer  or  further, 
these  relations  do  not  change,  they  are  merely  transposed  up  or  down  the  frequency  axis.  If  the  visual 
system  had  important  gain  differences  between  different  spatial  frequencies,  then  these  differences  would 
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have  be  incorporated  into  object  descriptors  in  order  to  preserve  object  invariance  with  scale  changes. 
Clearly,  object  description  at  a  high  level  can  be  more  economical  when  the  low  level  description 
accurately  represents  the  object's  spatial  frequency  content  Constant  gain  across  frequencies  is  the 
simplest  way  to  begin  a  scale-invariant  description. 

An  auditory  object  (e.g.,  a  voice  or  a  tune)  is  characterized  by  the  relations  between  the  component 
auditory  frequencies.  To  a  first  order,  moving  closer  or  further  corresponds  to  an  overall  intensity  change. 
If  changing  intensity  changed  the  internal  frequency  relations,  the  internal  object  descriptor  would  have  to 
be  intensity  (distance)  dependent.  At  sounds  near  threshold,  their  internal  representation  will,  to  some 
extent,  inevitably  reflect  the  ear’s  sensitivity.  Above  threshold,  it  would  be  desirable  for  object 
descriptions  to  be  intensity  invariant  Indeed,  the  further  above  threshold,  the  less  loudness  varies  as  a 
function  of  frequency. 

3.  Why  You  Can’t  See  the  Forest  for  the  Trees:  The  Economics  of  Connectivity 

In  audition,  frequencies  above  20,000  Hz  are  too  high  to  be  audible  and  amplification  will  not  make 
them  audible.  In  vision,  the  opposite  occurs.  For  example,  in  Parish  &  Sperling’s  (1987a)  letter 
discrimination  task  with  2-octave-wide  bands,  the  higher  the  center  frequency  of  the  band,  the  more 
discrim  inable  the  letters.  Is  there  an  upper  visual  object  frequency  at  which  the  trend  to  improved 
discriminability  reverses?  What  happens  as  visual  stimuli  are  filtered  in  higher  and  higher  spatial 
frequency  bands? 


Insert  4  here. 


Consider  first  an  ideal  letter  stimulus  in  which  the  letter  has  perfect  zero/one  step  edges  (Fig.  4b). 
When  such  an  edge  is  bandpass  filtered,  for  example,  by  a  difference  of  Gaussians  filter  (Fig.  4a),  the 
results  are  alternate  dark/light  stripes  centered  at  the  edge  (Fig.  4c).  When  the  frequency  spectra  of 
different  filter  bands  are  related  by  simple  translation  on  a  log  frequency  axis  (i.e.,  the  frequency  filters 
differ  only  in  scale),  the  basic  shape  of  the  bandpass  filtered  edge  is  independent  of  the  center  frequency  of 
the  filter.  By  suitably  scaling  the  abscissa,  we  arrive  at  a  canonical  representation  like  that  of  Fig.  4b  and 
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4c  in  which,  as  frequency  band  changes,  only  the  distance  between  edges  changes,  not  the  shape  of  the 
edge  representation. 

Obviously,  the  higher  the  filter’s  center  frequency,  the  narrower  the  edge  representation.  But  spatial 
bandpass  filters  operate  on  object-not  retinal-frequencies.  As  the  frequency  of  a  band  is  made  higher  and 
higher,  the  viewer  can  preserve  a  constant  retinal  frequency  by  approaching  the  letter  closer  and  closer, 
ultimately,  with  a  microscope.  Locally,  the  edges  filtered  at  different  center  frequencies  /  will  produce 
exactly  the  same  retinal  images  when  the  viewing  distance  is  proportional  to  1//.  What  changes  with 
viewing  distance  is  the  retinal  distance  between  opposite  edges  of  a  letter  stroke.  In  normal  viewing,  the 
thickness  of  a  letter  stroke  may  be  a  few  minutes  of  arc.  With  sufficient  magnification,  the  width  of  the 
letter  stroke  ultimately  comes  to  occupy  degrees  or  even  hundreds  of  degrees  of  visual  angle.  For 
extremely  high  frequency  bands,  when  a  letter  has  been  enlarged  sufficiently  to  make  its  edges  visible,  it  is 
physically  impossible  to  view  the  whole  letter  at  one  time.  This  is  the  classical  problem  of  trying  to  read  a 
newspaper  under  a  high  power  microscope.  Even  individual  letters  become  unrecognizable. 

For  the  cases  of  letters  primed  with  real  ink  on  real  paper,  or  illuminated  letters  on  real  CRT  screens, 
the  scaling  problem  is  similar  to  the  case  of  ideal  letters.  High  spatial  frequencies  represent  local  texture 
information  about  the  ink  and  paper  or  about  how  the  CRT  screen  is  populated  with  phosphor-local  texture 
that  obscures  the  larger  landscape.  This  is  the  problem  of  being  unable  to  see  the  fewest  for  the  trees 
(Sperling  &  Parish,  1985).  The  scale  of  observation  is  inappropriate  for  the  object  being  observed. 

Economy  of  connection.  What  is  the  appropriate  scale  of  observation?  This  is  dictated  by  the 
principle  of  economy  of  connection.  To  compute  relations,  sensors  must  be  connected  to  each  other.  It  is 
uneconomical  for  every  sensor  in  a  large  field  to  be  connected  with  and  to  compute  its  relations  to  every 
other  sensor;  typically  sensors  are  connected  only  to  immediate  neighbors  and  to  nearby  neighbors.  A 
sensor  and  its  similar  neighbors  form  a  kind  of  module.  The  size  of  the  visual  receptive  field  viewed  by  a 
module  is  inversely  related  to  the  module’s  characteristic  spatial  frequency.  In  this  arrangement,  the 
optimal  scale  of  observation  is  when  the  object  is  of  the  same  order  of  size  as  the  receptive  field  of  the 
sensors  so  that  the  object  can  be  entirely  described  within  a  module.  Indeed,  the  spatial  frequency  band 
that  is  most  efficient  for  letter  recognition  is  one  cycle  per  object,  i.e„  the  same  order  of  size  as  the  object 
The  size  relation  between  letters  and  the  spatial  frequencies  that  were  empirically  found  to  be  most 
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efficient  in  identifying  them  is  illustrated  in  Fig.  5.  Note  that  several  such  spatial  frequency  filters,  in 
different  orientations  and  phases,  would  be  required  to  discriminate  between  the  26  upper-case  letters. 


Insert  5  here. 


When,  on  the  other  hand,  much  smaller-sized  sensors  are  used  to  describe  a  large  object  then,  in  a 
hierarchically  organized  system,  it  requires  communication  between  modules,  communication  that  occurs 
only  at  higher  levels.  Empirically,  using  high  spatial  frequencies  to  describe  large  objects  results  in  a  loss 
of  perceptual  efficiency.  Below,  ve  consider  some  reasons  why  communication  between  modules  might 
entail  a  loss  of  information. 


4.  Two  Processing  Systems 

The  basic  thesis  of  this  section  is  that  there  are  two  processing  systems:  a  Fourier  system  that  uses 
phase  information  and  makes  local  computations  within  a  small  local  area  (a  module);  and  a  nonFourier 
system  that  discards  phase  information  and  coordinates  computations  made  in  different  modules.  We 
approach  these  general  issues  by  considering  an  analogy  from  radio  communication. 

Demodulation. 

High  frequency  carriers.  In  AM  (amplitude  modulated)  radio  communication,  the  amplitude  of  a 
high  frequency  carrier  wave  is  modulated  by  the  voice  frequencies  that  arc  to  be  transmitted.  Voice 
frequencies  of  up  to  about  10,000  Hz  are  transmitted  as  amplitude  modulations  of  a  100,000  Hz  carrier 
frequency.  The  process  of  extracting  the  low-frequency  modulating  signal  from  the  high-frequency  carrier 
frequency  is  demodulation. 

In  visual  object  recognition,  an  analogous  process  of  modulation  occurs  when  an  object  A  whose 
overall  shape  is-by  definition-characterized  by  frequencies  around  one  cycle  per  object,  is  differentiated 
from  its  surround  by  higher  frequencies.  This  would  occur  if  the  object  had  a  surface  texture  that  differed 
from  the  background  texture.  In  that  case,  a  spatial  filter  tuned  to  one  of  the  dominant  spatial  frequencies 
in  A ,  say  /« ,  would  record  a  large  response  wherever  A  was  present,  and  smaller  responses  elsewhere. 
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Another  object,  B ,  might  contain  an  intermediate  amount  of  /«  but  a  larger  amounts  of  other  spatial 
frequencies  (Fig.  5b).  Textures  function,  like  colors,  to  characterize  objects. 


Insert  6  here. 


There  is  a  perfect  analogy  between  a  characteristic  texture  frequency  and  an  AM  carrier  frequency. 
The  goal  of  demodulation  is  the  same  in  both  instances.  In  AM  modulation,  demodulation  means 
estimating  how  much  carrier  signal  (its  amplitude)  is  present  at  each  instant  in  time.  In  a  texture-defined 
object,  the  problem  for  the  visual  system  is  estimating  how  much  carrier  signal  is  present  at  each  point  in 
space. 

A  simple  form  of  demodulation  involves  fullwave  rectification  (taking  the  absolute  value)  of  the 
signal  (Fig.  5d).  The  modulated  carrier  is  rectified  and  then  lowpass  filtered  (Fig.  5e  and  f)  to  remove  the 
carrier  and  higher  frequencies;  only  the  original  modulating  signal  remains.  In  the  visual  system,  after  the 
initial  receptors,  positive  and  negative  signals  are  carried  in  separate  channels  (for  example,  on-center, 
off-center  neurons).  An  alternative  method  of  transmitting  positive  and  negative  quantities  is  to  modulate 
the  resting  firing  rate  of  a  neuron  up  and  down.  The  advantage  of  using  separate  positive  and  negative 
channels  is  that  zero  signal  means  zero  impulses  per  second,  and  so  the  average  firing  rate  is  minimized. 


Insert  7  here. 


When  there  are  separate  on-  and  off-channels,  to  preserve  the  sign  of  the  signal  at  at  subsequent 
synapses,  the  target  synapses  for  on-  and  off-  neurons  must  operate  in  opposite  directions  (excitation  or 
inhibition).  Fullwave  rectification  is  accomplished  when  the  target  synapses  of  the  on-  and  off-channels 
operate  in  the  same  direction  (see  Fig.  7).  In  terms  of  the  high-frequency  sensors  of  the  carrier  frequency, 
fullwave  rectification  means  that  the  sensors  communicate  information  about  their  location  and  the 
magnitude  (but  not  the  sign)  of  their  responses  to  the  next  higher  level  of  the  system.  On  the  other  hand, 
halfwave  rectification  corresponds  to  independent  analyses  of  the  on-  and  off-channel  signals,  a  process 
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that  has  been  proposed  as  a  mechanism  for  locating  luminance  boundaries  (Watt  &  Morgan,  1985). 

Converting  the  output  of  high  frequency  detectors  to  lower  frequencies  (demodulation)  is  a  critical 
component  of  object  recognition  because  objects  are  defined  most  efficiently  and  most  economically  in  the 
lowest  feasible  frequency  range.  The  computational  advantage  of  a  hierarchical  demodulator  scheme  is 
that  pattern  recognition  at  the  higher  level  can  use  a  single  computation  that  is  independent  of  the  scale  or 
the  contrast  of  the  sensors  that  are  transmitting  information  from  lower  levels.  Because,  in  this  context, 
demodulation  involves  going  from  higher  to  lower  spatial  frequencies,  the  pattern  recognition  algorithm 
can  operate  at  the  lowest  frequency.  Using  the  lowest  possible  frequencies  is  computationally  efficient 
because  of  the  economy  of  connection:  A  neuron  and  its  immediate  neighbors  span  the  field  of  interest. 

In  letter  discrimination,  the  experimentally  measured  efficiency  of  discrimination  was  highest 
( eff  =0.4 )  at  1  cycle/object,  the  lowest  usable  band  of  spatial  frequencies.  Efficiency  decreased  to  0.1  at 
10  cycle/object.  Informational  inefficiency  is  an  unavoidable  consequence  of  rectification  because  a 
computation  that  discards  the  sign  of  the  input  cannot  be  as  efficient  as  one  that  takes  sign  into  account. 
However,  statistical  inefficiency  is  a  consequence  of,  not  direct  evidence  for,  demodulation  or  rectification. 
For  direct  evidence,  we  turn  to  other  paradigms. 

Direct  evidence  for  two  computational  regimes  in  motion  perception. 

Perhaps  the  most  convincing  way  to  demonstrate  two  computational  systems  is  to  embed  two 
conflicting  cues,  one  aimed  at  each  system,  in  the  same  stimulus.  The  best  examples  occur  in  the  domain 
of  motion  stimuli.  The  image  of  moving  stimulus  is  a  three-dimensional  (3D)  function  that  gives 
luminance  l(x,y,t)  as  a  function  of  x,y,t .  To  represent  this  3D  function  on  a  printed  page,  we  use  x,r) 
cross-sections  that  omit  the  y  dimension  as  illustrated  in  Fig.  8a  and  b.  Figure  8a  shows  a  frame-by-frame 
representation  of  a  rightward  moving  black  bar,  Fig.  b,  shows  the  corresponding  xj  cross-section. 
Superimposed  on  the  bar's  x,t  crossecdon  in  Fig.  8b  is  a  sinewave.  This  sinewave  is  the  x ,  f  crossection 
of  a  sinusoidal  gradng  that  is  moving  at  the  same  velocity  as  the  bar.  This  particular  moving  gradng 
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represents  one  of  the  largest  Fourier  components  of  the  moving  bar. 


Insert  8  here. 


Figure  8c  shows  a  space-time  representation  of  a  motion  stimulus  that  has  conflicting  cues-a 
contrast-reversing  bar,  based  on  Anstis’s  (1970)  reversed  phi  phenomenon.  The  bar  steps  sideways  across 
a  gray  field,  alternating  its  contrast  between  black  (-1)  and  white  (+1)  on  each  step. 

In  the  x,  t  cross-section,  the  bar  moving  to  the  right  appears  as  a  contrast-reversing  diagonal  slanting 
to  the  right.  However,  the  Fourier  sinewave  components  of  the  contrast-reversing  bar  are  slanted  down  and 
to  the  left,  indicating  Fourier  motion  to  the  left  By  rectifying  the  contrast-reversing  bar,  i.e.,  taking  the 
absolute  value  of  its  contrast,  the  result  is  the  stimulus  of  Fig.  8b;  and  its  Fourier  sinewave  components  are 
slanted  downward  to  the  right,  indicating  rightward  motion. 

When  the  contrast-reversing  bar  is  viewed  from  near,  it  seems  obviously  to  move  to  the  right 
However,  when  it  is  viewed  peripherally,  or  from  a  distance,  or  at  very  low  contrast,  it  apparently  moves  to 
the  left  (Chubb  &  Sperling,  1988a,  1989b).  This  clearly  indicates  that  observers  make  two  different  kinds 
of  motion  computations. 

For  motion  stimuli,  Chubb  and  Sperling  arrive  at  a  functional  discrimination  between  first-order 
(Fourier,  direct)  and  second-order  (nonFouric*-,  rectified)  processes.  (Note  2.)  They  refer  to  the  first-order 
regime  as  a  Fourier  process  because  it  is  well  modeled  by  linear  filters  that  utilize  the  Fourier 
decomposition  of  the  stimulus.  The  second-order  system,  which  involves  rectification,  operates  better  over 
larger  retinal  distances  than  does  the  Fourier  system.  Consistent  with  the  lower  efficiency  of  rectification, 
the  second-order  system  has  higher  contrast  thresholds  than  the  Fourier  system.  Certain  values  of  the 
parameters  of  viewing,  (such  as  small  retinal  size,  peripheral  retinal  location,  and  low  stimulus  contrast) 
increase  the  relative  strength  of  the  first-order  versus  the  second-order  computation. 

While  the  contrast-reversing  bar  is  a  simple  demonstration  stimulus,  it  does  not  enable  one  to 
discriminate  between  different  second-order  computations.  Chubb  &  Sperling  (1989b)  demonstrate  a 
sideways  stepping,  contrast-reversing  grating,  a  stimulus  which  displays  obvious  second-order  motion  and 
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in  which  halfwave  rectification,  alone  or  in  combination  with  any  reasonable  temporal  transformation,  can 
be  excluded.  In  displays  that  were  designed  to  exclude  fullwave  rectification  and  admit  only  halfwave 
rectification  or  Fourier  motion  analysis,  Sperling  and  Chubb  (1987)  found  only  such  weak  second-order 
motion,  that  they  did  not  preclude  alternative  explanations.  Thus,  the  predominant  mechanism  of  second- 
order  motion  perception  involves  fullwave  rectification.  Fullwave  rectification  also  is  the  dominant 
mechanism  in  second-order  texture-orientation  processing  of  the  x,y  patterns  that  represented  the  x,t 
cross-sections  of  the  motion  stimuli  in  their  motion  experiments  (Note  3). 

In  motion  perception,  there  is  a  well-established  distinction  between  short-range  and  long-range 
motion  processes  (e.g.,  Braddick,  1974;  Pantle  &  Picciano,  1976;  Westheimer  &  McKee,  1977;  Victor  & 
Conte,  1989b).  The  inadequacy  of  first-order  motion  processing  has  been  ampiy  documented  by 
Ramachandran,  Madhusudhan  Rao,  and  Vidyasagar  (1973),  Sperling  (1976),  Lelkens  &  Koenderink  (1984) 
Pantle  &  Turano  (1986),  and  Victor  &  Conte  (1989a).  The  properties  adduced  for  the  non-first-order 
system  are  generally  those  described  above,  plus  a  relative  insensitivity  to  the  eye  of  origin  of  successive 
stroboscopic  stimuli.  To  these  can  be  added  the  observations  of  Dosher,  Landy,  &  Sperling  (in  press)  and 
Landy,  Sperling,  Dosher,  &  Perkins  (1987)  that  first-order  motion  supports  the  kinetic  depth  effect  (KDE, 
Wallach  &  O’Connell,  1953)  whereby  3D  structure  is  perceived  in  2D  moving  stimuli,  whereas  KDE 
induced  by  second-order  motion  stimuli  is  weak  and  of  enormously  lower  resolution  (e.g.,  Piazdney, 
1987). 

The  computations  of  first-order  motion  are  well  embodied  in  the  quite  similar  models  of  Watson  and 
Ahumada  (1983),  van  Santen  and  Sperling  (1984),  and  Adelson  and  Bergen  (1985),  which  van  Santen  & 
Sperling  (1984,  1985)  supplement  with  surprising  predictions  of  first-order  relationships  that  are  verified 
experimentally.  What  Chubb  &  Sperling  (1988b,  1989a,  1989b)  have  added  is  a  computational 
specification  of  a  second-order  motion  system  together  with  methods  for  producing  stimuli  that  can  be 
proved  to  be  directly  aimed  at  one  or  the  other  system.  As  a  consequence,  it  is  easily  shown  that  (retinal) 
short-range  and  (retinal)  long-range  are  inadequate  system  descriptions  because  there  is  a  broad 
intermediate  range  in  which  both  computations  operate. 

Orientation  and  motion  perception.  Strong  evidence  for  two  computational  regimes  is  obtained  in 
studies  of  orientation  detection  in  textured  patterns  as  well  as  in  studies  of  direction  discrimination  in  one- 
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dimensional  motion  perception.  Indeed,  these  two  problems  involve  formally  identical  computations  (van 
Santen  &  Sperling,  1985;  Chubb  &  Sperling,  1987,  1988b).  Figures  8d  and  8e  show  demonstrations  of 
stimuli  that  show  obvious  apparent  motion  (when  presented  as  motion  stimuli)  and  obvious  slant 
(orientation)  when  presented  in  x ,y ,  as  in  the  illustration. 

The  stimuli  of  Fig.  8d  and  8e  are  driftbalanced:  that  is,  they  are  exemplars  of  random  stimuli  in 
which  the  expected  motion  (or  orientation)  is  exactly  equal  far  every  pair  of  oppositely-directed  component 
Fourier  frequencies  (Chubb  &  Sperling,  1988b).  (The  overlaid  sine  gratings  of  Figs.  8b  and  8c  are  an 
example  of  two  oppositely-directed  Fourier  components-their  slants  in  the  x ,  t  cross-sections  are  equal 
and  opposite.  The  stimuli  of  8d  and  8e  are  microbalanced.  This  means,  roughly,  that  every  little  area, 
whatever  its  shape,  in  these  stimuli  is  driftbalanced.  Thus,  the  obvious  orientation  in  these  x,y  stimuli  is 
invisible  to  every  linear  Hubel-Wiesel  cell  (i.e.,  neurons  with  receptive  fields  such  as  illustrated  in  Fig.  5). 
The  motion  in  the  x ,  r  versions  of  the  stimuli  is  invisible  to  any  standard  (Fourier  or  Reichardt-equivalent) 
motion  detector.  Rectification  is  required  to  make  the  x,t  motion  or  x,y  orientation  in  these  stimuli 
accessible  to  standard  motion  or  orientation  analysis  (Chubb  and  Sperling,  1988). 

Figure  8f  shows  an  example  of  a  texture  quilt  (Chubb  &  Sperling,  1989a).  To  make  the  overall 
motion  in  such  a  stimulus  accessible  to  motion  analysis  requires  an  initial  stage  of  selective  spatial  filtering 
(texture  grabbing)  followed  by  rectification  and  standard  motion  (or  texture)  analysis.  No  purely  temporal 
transformation,  no  matter  how  complex  and  nonlinear,  can  make  this  motion  (or  texture)  accessible  to 
first-order  analysis  (Chubb  &  Sperling,  1989a).  The  squares  of  the  texture  quilt  are  each  filled  with  their 
unique  carrier  frequencies,  and  these  frequencies  must  first  be  extracted  and  demodulated  to  reveal  the 
larger  pattern. 

Since  the  work  of  Schade  (1952)  and  DeLange  (1954),  first-order  Fourier-based  computations  have 
been  the  cornerstone  of  psychophysical  analysis  (Note  4).  The  examples  of  Figure  8c,d,e  show  the 
limitations  of  first-order  linear  analysis  and  the  necessity  of  postulating  second-order  computations. 
Texture  quilts  (Fig.  80  provide  a  fine  tool  for  studying  the  spatial  properties  of  second-order  motion. 

Distance  estimation.  Distance  estimation  experiments  also  yield  evidence  for  two  processing 
systems,  one  Fourier  and  one  rectifying  system,  in  spatial  vision.  In  a  three-bar  distance  estimation 
experiment,  an  observer  must  judge  whether  a  central  line  is  equally  spaced  between  two  flanking  bars.  In 
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a  two  bar  task,  the  observer  judges  the  distance  between  two  widely  separated  bars.  On  a  grey  background, 
for  widely  separated  bars,  it  matters  not  whether  any  of  the  bars  is  black  and  the  other  white,  or  whether 
both  are  white  or  black  (Burbeck,  1987).  Indeed,  when  "bars"  are  defined  by  patches  of  high  frequency 
gratings  so  that  the  the  bars  themselves  do  not  differ  in  average  luminance  from  their  surround,  distance 
judgments  are  as  accurate  as  with  solid  bars. 

That  observers  accurately  judge  the  distance  between  widely  separated  grating  patches  virtually 
guarantees  a  demodulatory  process  in  which  the  grating  patch  is  converted  to  a  solid  patch.  To  solve  the 
distance  task  with  a  first-order  computation,  i.e.,  with  linear  receptive  fields  and  without  demodulation, 
would  involve  horrendous  complications.  The  linear  receptive  fields  needed  for  distance  judgments  are 
dumbbell-shaped  receptive  fields  with  one  end  of  the  dumbbell  in  each  patch.  Receptive  fields  would  be 
phase  sensitive  with  responses  that  varied  from  negative  to  zero  to  positive  depending  on  just  where  in  the 
receptive  field  the  stimulus  patches  fell.  Receptive  fields  would  have  to  be  duplicated  for  all  orientations, 
distances,  and  pairs  of  frequencies.  Otherwise,  for  example,  distance  judgments  would  be  impossible  if  the 
two  bars  being  judged  were  of  different  spatial  frequencies.  In  fact,  the  distance  between  two  grating 
patches  of  different  spatial  frequencies  is  judged  as  accurately  as  the  other  distances  (Burbeck,  1988). 
Demodulation  resolves  all  these  problems  of  first-order  computations  at  once  by  transposing  distance 
judgments  to  the  lowest  common  domain. 

For  closely  spaced  bars,  there  is  a  significant  difference  between  the  same-contrast  and  opposite- 
contrast  patterns.  Thus  there  again  is  the  telltale  rectification-times-size  interaction  that  indicates  two 
processing  regimes:  rectification  dominating  at  large  retinal  sizes,  and  direct  computation  at  small  retinal 
sizes. 

Klein  and  Levi  (1985)  were  led  to  propose  two  processing  regimes  based  on  a  size-times-stimulus- 
type  interaction  in  a  bisection  type  of  distance  judgment  The  observer’s  task  was  to  estimate  which  of  two 
horizontal  flanking  lines  was  closer  to  a  central  line.  The  flanking  lines  were  either  directly  above  and 
below  the  central  line  or  displaced  sideways.  With  large  retinal  images,  the  sideways  displacement  was 
immaterial:  with  small  retinal  images,  it  was  critical.  This  difference  between  results  at  close  spacings  and 
far  spacings  of  lines  in  psychophysical  judgments  led  Klein  and  Levi  (1985)  to  postulate  two  regimes  of 
detection  mechanisms.  They  proposed  a  regime  for  small-size  computations  that  relied  on  efficient  linear 
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filters  (direct  computation),  but  they  did  not  propose  a  specific  regime  for  large-size  computations. 
However,  the  failure  of  the  first-order  small-size  computation  to  account  for  the  large-size  results  is 
consistent  with  the  rectification  proposal,  although  Klein  and  Levi’s  results  do  not  specifically  require 
rectification. 

Two  Processing  Regimes:  Conclusions. 

For  bandpass  filtered  objects,  different  computations  will  be  carried  out  depending  on  whether  the 
object  can  be  coded  by  neighboring  sensors  or  whether  it  requires  the  coordination  of  information  from 
distantly  separated  sensors  Nearest  neighbor  computations  can  use  linear  filters  and  can  be  highly  efficient 
(first-order  computations).  Distant  computations  require  demodulation  (which  is  carried  out  by  fullwave 
rectification)  and  information  that  is  coordinated  at  higher  levels  of  the  computational  hierarchy  (second- 
order  computations).  The  second  order  computations,  because  they  use  rectification  for  demodulation, 
sacrifice  statistical  efficiency  (impaired  compared  to  ideal  detectors)  for  computational  simplicity 
(improved  relative  to  attempting  the  computation  at  the  same  hierarchical  level).  An  interesting  unresolved 
issue  that  would  relate  the  stages  and  systems  of  this  paper  concerns  the  extent  to  which  the  noise  sources 
associated  with  each  type  of  computation  (first-order,  second-order)  are  shared  or  independent. 

It  seems  obvious  that  counting  and  labeling  (rectification)  operations  will  predominate  over  linear 
processes  at  higher  perceptual  and  cognitive  levels  of  processing.  The  surprise  has  been  that  simple 
rectification  occurs  so  early  in  processing,  being  involved  in  retinal  gain  control  and  in  the  earliest  stages  of 
motion  and  pattern  analysis.  Presumably  the  appearance  of  rectification  early  in  visual  processing  is 
determined  by  two  factors:  its  economy  of  neural  connectivity  in  a  hierarchically  organized  nervous  system 
and  its  ecological  adequacy  in  our  natural  environment 
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Reference  Notes 

Note  1.  Informal  observations.  Variations  in  contrast  and  luminance  were  not  reported  in  Parish  & 
Sperling  (1987a). 

Note  2.  The  nomenclature  was  suggested  by  P.  Cavanaugh  (1988)  in  "Motion:  The  long  and  the  short  of 
it."  Paper  at  the  Workshop  on  Visual  Form  and  Motion  Perception:  Psychophysics,  Computation,  and 
Neural  Networks.  Boston  University,  March  5,  1988. 

Note  3.  Although  it  is  embedded  in  a  much  more  complex  framework,  Grossberg  and  Mingolla  (1985) 
incorporate  a  fullwave  rectifying  stage  in  their  general  model  of  texture  and  boundery  perception  that 
appears  to  deal  with  second-order  stimuli.  However  rectification  and  similar  nonlinear  operations  such 
as  squaring  do  not,  in  and  of  themselves,  imply  second-order  processing.  For  example,  the  Adelson 
and  Bergen’s  (1985)  detector  of  directional  motion  enery  is  equivalent  to  the  Reichardt  motion  mode! 
(van  Santen  &  Sperling,  1984,  1985)  and  to  Knutsen  &  Grandlund’s  (1983)  texture-orientation  model. 
All  of  these  models  embody  a  nonlinear  squaring  stage  (or  the  equivalent)  and  they  merely  perform 
first-order  computations;  none  can  detect  the  second-order  motion  or  orientation. 

Note  4.  Ives  (1922)  anticipated  subsequent  linear  theories  of  visual  threshold  phenomena  but  he  was 
ignored  by  the  psychophysicists  of  his  time  because  they  did  not  understand  linear  systems  theory. 
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Figure  Captions 

Figure  1.  Upper:  A  sample  of  the  letter  G  filtered  in  five  spatial  frequency  bands.  The  number  above 
the  band  indicates  the  2D  mean  frequency  (cycles  per  letter  height)  of  the  approximately  two-octave 
wide  band.  Lower  The  filtered  letter  plus  noise  in  the  same  bands  with  a  signal-to-noise  ratio  of  0.50 
in  all  panels.  The  effective  sin  in  the  reproduction  is  somewhat  lower.  (From  Parish  and  Sperling. 
1987a.) 

Figure  2.  Three  stages  of  visual  processing.  Suns  indicate  stationary  noise  sources,  +  indicates 
summation  components,  boxes  indicate  linear  filters,  triangles  indicate  amplifiers,  and  the  blocked 
triangle  indicates  a  rectifier.  Double  boxes  show  input/output  relations.  Adaptation.  The  visual  input 
is  u,  the  light-adapted  output  is  v.  The  center/suiround  organization  is  produced  by  two  pairs  of 
separable  spatial  and  temporal  filters,  F\,F 2  respectively,  whose  impulse  responses  are  indicated  by  in 
boxes  x  and  t .  The  surround  controls  the  gain  of  amplifier  K 1  to  produce  Weber  law  light  adaptation 
(Sperling  &  Sondhi,  1968).  The  dark-adapted  impulse  responses  are  shown  by  the  dark  lines  vx  versus 
x  and  v,  versus  t.  The  light  lines  show  the  light-adapted  input/output  relation.  Dark  noise  adds 
directly  to  the  input,  sensory  noise  adds  after  light-adaptation.  Contrast-gain  control.  The  graphs  in 
box  F 3  indicate  various  oriented  and  nonoriented  spatial  filters  that  operate  on  the  light-adapted  signal 
v.  The  fullwave  rectified  outputs,  indicated  by  the  blocked  triangle  R,  control  the  gain  of  the 
corresponding  amplifiers  K 3,  only  one  rectifier  and  amplifier  is  shown.  A  typical  input-output  function 
w  vs  v  is  shown  in  the  insert.  Postsensory  stages.  The  first-order  gain-controlled  signal  wi  and  the 
second-order  gain-controlling  signals  W2  are  combined  with  each  other  and  with  noise.  The 
postsensory,  decision,  mnemonic,  and  response  processes  are  not  detailed.  The  overall  system  output 
isz. 

Figure  3.  The  contrast  modulation  transfer  function  (MTF)  and  the  frequency  ranges  of  the  letter-in- 
noise  stimuli  of  the  Parish  &  Sperling  letter  discrimination  experiments.  The  MTF  gives  the  contrast 
detection  threshold  (in  percent  contrast  modulation)  for  sine  gratings  as  a  function  of  their  retinal 
spatial  frequency;  it  is  based  on  data  of  van  Nes  and  Bouman,  1967.  Stipling  indicates  the  area,  near 
threshold,  where  quanta!  noise  and  additive  sensory  noise  predominate.  Noise  also  predominates  in 


Draft,  Revision  2 


Sperling:  Stages  and  Systems  of  Visual  Processing 


Page  29 


the  whole  upper  portion  of  the  graph,  where  stimuli  are  invisible.  Each  open,  downward  facing 
rectangle  indicates  the  approximate  half-bandwidth  of  a  frequency  bands  (1....5,  Fig.  1)  at  one  of  the 
extreme  viewing  distances  a,d  used  by  Parish  and  Sperling  (1987a).  The  horizontal  placement  of  the 
corresponding  number/letter  symbol  indicates  the  mean  retinal  frequency  of  the  stimulus:  symbols 
(b,c)  for  intermediate  viewing  distances  also  are  shown.  The  stimulus  symbols  are  placed  vertically  at 
a  contrast  >  10%  to  indicate  that  for  all  larger  contrasts  (downward  in  the  figure),  performance  is 
independent  of  contrast,  i.e.,  it  is  controlled  by  multiplicative  noise. 

Figure  4.  When  is  it  impossible  to  see  both  the  forest  and  the  trees?  The  retinal  image  of  a  bandpass 
filtered  boundary  can  remain  independent  of  filter  frequency  fa,  by  varying  the  viewing  distance  d. 
For  two  boundaries  that  are  physically  separated  by  a  distance  4>,  the  retinal  distance  r  is  §!d. 
Example:  (a)  Impulse  response  of  /<*  a  family  of  linear  spatial  filters  that  differ  only  in  their  scale  to 
(e.g.,  to  is  the  center  frequency  of  the  passband).  (b)  A  retinal  illuminance  distribution  l(x) 
representing  the  left  and  right  boundary  of  an  ideal  white  stripe,  (c)  The  retinal  illuminance 
distribution  of  the  filtered  image  /*/ <»  with  viewing  distance  d  chosen  so  that  d= 1/to.  This  choice  of  d 
normalizes  (leaves  unchanged)  the  images  of  the  boundaries  (neighborhoods  of  dashed  lines  in  b  and 
c ).  Only  r ,  and  not  the  boundary  images,  varies  with  to.  (d)  Ultimately,  for  very  large  to,  r  grossly 
exceeds  360  degrees.  When  it  is  necessary  to  view  the  boundary  from  extremely  close  in  order  to 
achieve  visible  detail,  it  will  then  be  impossible  to  simultaneously  resolve  a  boundary  (see  a  tree)  and 
to  see  both  boundaries  (see  the  forest). 

Figure  5.  The  letter  T  and  receptive  fields  that  have  a  center  frequencies  of  1  cycle  per  letter  height 
(in  their  higher-frequency  dimension),  (a)  The  letter  T  centered  in  an  even  symmetric  receptive  field. 
The  +  and  -  signs  indicate  the  sign  of  the  field’s  response  to  spots  of  light  in  the  indicated  areas,  (b) 
Horizontal  cross  section  showing  the  sensitivity  of  the  receptive  field  as  a  function  of  position,  (c,  d) 
The  letter  T  within  an  odd-symmetric  receptive  field. 

Figure  6.  Carrier  frequencies,  amplitude  modulation,  and  demodulation,  (a)  A  carrier  frequency 
C(x)  =  sin(2jt/ex).  (b)  A  signal  S(x)  that  consists  of  an  object  A  which  has  a  large  amount  of  the 
carrier  fc  (one  of  its  characteristic  spatial  texture  frequencies),  a  second  object  B  which  has  an 
intermediate  amount,  and  the  background  which  has  a  small  amount,  (c)  A  representation  of  the  actual 


Draft,  Revision  2 


Sperling:  Stages  and  Systems  of  Visual  Processing 


Page  30 


frequencies  in  the  image,  the  image  luminance  distribution:  L(x)  =  S(x)C(x),  an  amplitude 
modulated  carrier.  (In  visual  scenes,  the  phase  of  the  carrier  is  not  be  preserved  across  objects.)  (d) 
The  rectified  image.  The  absolute  value  of  the  image  IS(x)C(x)l  is  the  simplest  instantiation  of 
fullwave  rectification,  (e)  A  lowpass  filter  (Normal  density  function),  (f)  The  result  of  lowpass 
filtering  (d),  LPcitimes  I  S(x)  C  (x)  I .  The  original  signal  S(x)  has  been  mostly  recovered. 

Figure  7.  How  linear  transformations,  fullwave  rectification,  and  half-wave  rectification  can  be 
accomplished  in  the  visual  system.  On  System  refers  to  neurons  that  have  an  on-center/off-surround 
receptive  field  organization  (Kuffler,  1953)  and  which  carry  signals  representing  positive  local 
contrasts  relative  to  the  surround.  Off  System  refers  to  neurons  that  have  off-center/on-surround 
receptive  fields,  and  which  transmit  information  about  negative  local  contrasts,  (a)  When  synapses 
from  an  On-System  neuron  onto  a  target  neuron  are  excitatory  and  Off-System  synapses  are  inhibitory 
(indicated  by  the  inverting  amplifier  -1  ),  the  sign  of  input  contrasts  is  preserved  and  first-order 
(Fourier)  motion  analysis  of  the  stimulus  can  occur,  (b)  Fullwave  rectification  occurs  when  both  On- 
and  Off-System  synapses  are  the  same  (either  excitatory  or  inhibitory);  this  results  in  second-order 
signal  analysis  that  is  "nonFourier."  (c)  Positive  halfwave  rectification  occurs  when  the  On-System 
signals  are  analyzed  independently;  negative  halfwave  rectification  refers  to  independent  analysis  of 
Off-System  signals.  Like  fullwave  rectification,  halfwave  rectification  is  a  second-order  processing 
scheme. 

Figure  8.  Stimuli  for  analyzing  second-order  processing,  (a)  An  x,y,t  representation  of  successive 
frames  of  a  motion  stimulus-a  black  bar  moving  rightward,  (b)  An  x,t  cross-section  of  (a).  A 
sinewave  grating,  representing  a  dominant  Fourier  component,  has  been  superimposed  on  the  x ,  t 
cross-section.  Note  that  the  detection  of  direction  of  motion  in  x ,  t  is  equivalent  to  the  detection  of 
direction  of  slant  in  x,y.  (c)  An  x,t  cross-section  of  a  windowed,  contrast-reversing  bar,  a  stimulus 
that  appears  to  move  leftward  from  afar  (first  order  motion)  and  rightward  from  afar  (second-order 
motion).  A  sinewave  grating,  representing  a  dominant  Fourier  component,  has  been  superimposed  on 
the  x,t  cross-section  to  indicate  the  direction  of  Fourier  movement  (d,  e)  x,f  cross-sections  of 
microbalanced  stimuli  whose  motion  is  invisible  to  first-order  motion  detectors  and  whose  slant  in 
their  x  ,y  representation  is  invisible  to  first-order  orientation  detectors  (e.g.,  Hubel-Weisel  cells).  (0  A 
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texture  quilt.  The  four  rows  represent  four  successive  frames  of  a  dynamic  stimulus.  The  initial 
extraction  of  either  the  low  spatial-frequency  texture  oriented  downward  left  or  of  the  high  frequency 
texture  oriented  downward  right  will  enable  a  first-order  motion  algorithm  to  extract  the  overall 
leftward  motion  (overall  slant  downward  to  the  left).  Texture  quilts  remain  microbalanced  after  any 
purely  temporal  ^ar.sformation  and  require  an  initial  texture  extraction  followed  by  rectification  to 
expose  their  motion  in  x ,  t  (or  orientation  in  x ,  y )  to  standard  analysis. 
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NOTES 

Sampling  really  is  multiplicative  noise  -  see  Burgess-Kerten-Legge  1987 
(DONE) 


Want  high  intensity  stimuli  to  have  a  uniform  freq  spectrum  so  that 
retinal  magnif  and  minific  leave  spectral  relations  unaltered 


bar:  Fig  2  still  is  under  explained 


Mention  Prazdney 
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